Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4

Large language models (LLMs) are seen to have tremendous potential in advancing medical diagnosis recently, particularly in dermatological diagnosis, which is a very important task as skin and subcutaneous diseases rank high among the leading contributors to the global burden of nonfatal diseases. Here we present SkinGPT-4, which is an interactive dermatology diagnostic system based on multimodal large language models. We have aligned a pre-trained vision transformer with an LLM named Llama-2-13b-chat by collecting an extensive collection of skin disease images (comprising 52,929 publicly available and proprietary images) along with clinical concepts and doctors’ notes, and designing a two-step training strategy. We have quantitatively evaluated SkinGPT-4 on 150 real-life cases with board-certified dermatologists. With SkinGPT-4, users could upload their own skin photos for diagnosis, and the system could autonomously evaluate the images, identify the characteristics and categories of the skin conditions, perform in-depth analysis, and provide interactive treatment recommendations.

• The paper discusses the advantages of each step in the two-step training strategy and presents supporting evidence in Figure 3 and Section 2.3.
• Previous works in related areas using LLMs have some limitations due to privacy issues.For example, ChatCAD uses the ChatGPT API, which requires users to submit sensitive text descriptions to an online API.In contrast, SkinGPT-4 overcomes this privacy concern by enabling local deployment.Based on the evaluation results provided, this feature has increased the users' willingness to use this framework.
• Compared to MiniGPT-4, SkinGPT-4 offers improved diagnosis and explanations.MiniGPT-4 was primarily pre-trained on natural images, which are irrelevant to medical images.In contrast, SkinGPT-4 performs better in this regard due to its two-step training strategy.In the initial step, SkinGPT-4 aligns visual and textual clinical concepts, enabling it to detect medical features.The second step enables SkinGPT-4 to map detected medical features to diseases.(Fig 3) • The evaluation indicates that SkinGPT-4 performs significantly better in real-life cases of various skin diseases compared to MiniGPT-4.It was assessed on 150 real-life cases, comparing its diagnosis to that of certified dermatologists.

Weaknesses
• While SkinGPT-4's diagnosis achieves an 80% agreement with certified dermatologists, it is important to consider how the framework addresses the issues of misinformation and misdiagnosis.
• The evaluation did not provide information on the demographics of the patients involved during validation.Furthermore, the real-life cases provided in the supplementary material largely belong to a specific skin color, suggesting a bias.Consequently, this evaluation fails to capture any biases in the method, which is a weakness of this paper.
• The authors emphasize that they address privacy concerns by allowing for local deployment of the framework.However, the paper does not provide information on the feasibility of such deployments, as LLMs and vision transformers are large models that require specific computational resources for full local deployment.Additionally, the usability of the framework would be limited to patients who have personal access to sufficient computational resources.
• There are ethical concerns and contradictions between data sharing and the inclusion of images from real-life cases in the supplementary sections.

Detailed Comments
1.While the novelty in the application is commendable, I couldn't identify any architectural innovations compared to BLIP-2.Are there any specific improvements in SkinGPT-4 compared to BLIP-2? 2. As mentioned in the weakness section above, could you provide more insights into how the method works on different demographics (Performance on patients with dark skin tones)?In my opinion, it is necessary to have information on the patients' demographics, as there may be biases in the method that are not captured by the current evaluation method.
3. SkinGPT-4's ability to provide preliminary diagnoses empowers patients to make informed decisions about whether to seek in-person medical attention.This capability was highlighted as a strength of the method in the paper.However, a crucial concern arises regarding the reliability of these preliminary diagnoses.For instance, in a scenario where the framework incorrectly diagnoses a skin condition, a patient might forego a doctor's visit.As time passes, the condition could deteriorate, leading to a delayed visit when treatment becomes more challenging.How does the framework intend to address such situations, ensuring the reliability of its diagnostic recommendations? 4. In light of the described scenario, how is responsibility for the situation managed? 5. Regarding the evaluations, how many certified dermatologists were involved in the evaluation and comparison process?What type of evaluation was conducted?Were the evaluators aware that the generations are from LLMs? 6.In Fig 4a, what does the barplot indicate?Is it the percentage of skin disease cases that the dermatologists agree on, or is it the percentage of dermatologists?7. When it comes to local deployment, how practical is it?Does it imply that the usability of the framework is restricted to patients who have access to sufficient computational resources?In my opinion, adding more information on local deployment would enhance the quality of the paper.
8. Regarding the supplementary section, is it ethical to share images of real-life cases when the identities of the patients are visible in the images?For example, cases like rosacea4, rosacea10, lupus1 etc.During a presentation, would it be better to blur certain areas in the images to preserve privacy? 9. Additionally, an ethical statement regarding the real-life cases used for evaluation is not provided.Furthermore, was the patient's consent obtained before including the images in the supplementary section?
Note on reproducibility: We attempted to run the code provided in the GitHub repository but encountered some issues during the installation and in the demo.pyfile.
Reviewer #2 (Remarks to the Author): In Zhou*, He*, Sun* et al, the authors establish a generalized framework for machine learning-guided digital diagnosis of skin diseases spanning rashes as well as neoplasms.A significant innovation in this paper is the integration of deep learning models for both quantitative image analysis as well as natural language processing using LLMs to accurately diagnose clinical cases in dermatology.Another important component of the article is validation using clinical input from board certified dermatologists.The article is well-written and the figures are well made.I think this article is of significant general interest as well as of interest to the dermatology community and could be a worthwhile contribution to the literature.However, there are several major comments I would like addressed in a revised manuscript before I endorse publication.
months, there has been rapid progress in the development of large language models (LLMs), leading to the emergence of updated and improved models within the domain of open-source LLMs.To ensure the optimal performance of our approach and minimize computational resource requirements, we have made updates by replacing the previously used LLM (Falcon-40B-Instruct) with a smaller but more easily deployable LLM .This new LLM, recognized as one of the state-of-the-art models according to several leaderboards, not only facilitates local deployment but also contributes to the overall efficiency of our method.
We tested the effect of the replacement on multiple cases and are happy to conclude that it significantly reduces the required computational resource for local deployment (less GPU memory and computing time) and slightly increases performance on the diagnostic capabilities.This interactive framework enhances patient communication and education by allowing users to upload their own skin photos and receive information about their condition as well as treatment recommendations.

Responses to
The system underwent qualitative evaluation by certified dermatologists using 150 real-life cases.Furthermore, in order to prioritize privacy preservation, this framework allows for local deployment.

Strengths
• The use of multi-modal LLMs to enhance diagnosis in the field of dermatology indicates the novelty of this method in terms of its applications.These systems aim to address the shortage of dermatologists, improve interpretations of skin disease images, and provide patient-friendly diagnosis reports.
• In addition to the novelty of the application, the proposed two-step training strategy further enhances the method's novelty by enabling the adaptation of multimodal LLMs for specific domains.
• The paper discusses the advantages of each step in the two-step training strategy and presents supporting evidence in Figure 3 and Section 2.3.
• Previous works in related areas using LLMs have some limitations due to privacy issues.For example, ChatCAD uses the ChatGPT API, which requires users to submit sensitive text descriptions to an online API.In contrast, SkinGPT-4 overcomes this privacy concern by enabling local deployment.Based on the evaluation results provided, this feature has increased the users' willingness to use this framework.
• Compared to MiniGPT-4, SkinGPT-4 offers improved diagnosis and explanations.MiniGPT-4 was primarily pre-trained on natural images, which are irrelevant to medical images.In contrast, SkinGPT-4 performs better in this regard due to its two-step training strategy.In the initial step, SkinGPT-4 aligns visual and textual clinical concepts, enabling it to detect medical features.The second step enables SkinGPT-4 to map detected medical features to diseases.(Fig 3 ) • The evaluation indicates that SkinGPT-4 performs significantly better in real-life cases of various skin diseases compared to MiniGPT-4.It was assessed on 150 real-life cases, comparing its diagnosis to that of certified dermatologists.

Weaknesses
• While SkinGPT-4's diagnosis achieves an 80% agreement with certified dermatologists, it is important to consider how the framework addresses the issues of misinformation and misdiagnosis.
• The evaluation did not provide information on the demographics of the patients involved during validation.
Furthermore, the real-life cases provided in the supplementary material largely belong to a specific skin color, suggesting a bias.Consequently, this evaluation fails to capture any biases in the method, which is a weakness of this paper.
• The authors emphasize that they address privacy concerns by allowing for local deployment of the framework.However, the paper does not provide information on the feasibility of such deployments, as LLMs and vision transformers are large models that require specific computational resources for full local deployment.Additionally, the usability of the framework would be limited to patients who have personal access to sufficient computational resources.
• There are ethical concerns and contradictions between data sharing and the inclusion of images from real-life cases in the supplementary sections.
AR: We would like to express our heartfelt gratitude to you for dedicating valuable time to review our manuscript and for providing us with precious and constructive feedback.We recognize the importance of such feedback and sincerely appreciate your thoughtful comments and suggestions.We have revised our paper by following all of your comments.For the listed weaknesses, please refer to our detailed responses to the following major concerns.

Major concern #1
RC: 1.While the novelty in the application is commendable, I couldn't identify any architectural innovations compared to BLIP-2.Are there any specific improvements in SkinGPT-4 compared to BLIP-2?AR: We appreciate your thorough review and valuable comments on our methodology.There are three major differences between SkinGPT-4 and BLIP-2: 1. Architectural differences: SkinGPT-4 adopts a larger state-of-the-art LLM compared to BLIP-2, which utilizes the smaller LLMs (FlanT5 and OPT).This architectural modification is aimed at improving conversational capabilities and enhancing the model's performance in generating coherent responses to user queries.

Training strategy:
The training strategy in BLIP-2 was identified by previous studies as potentially insufficient to align Vision Transformers (ViT) with LLMs, impacting text generation ability.In response to this, SkinGPT-4 employs a two-step training strategy as detailed in the revised manuscript.Our two-step training strategy is designed to compel the model to learn medical features from skin disease images and generate coherent text responses, thereby improving overall usability.
3. Q-Former fine-tuning: Unlike BLIP-2, we did not fine-tune the Q-Former in SkinGPT-4.Instead, we directly used the Q-Former aligned with FlanT5 from BLIP2 and trained a single projection layer.This simpler linear layer proves sufficient for enabling the LLM to understand and generate relevant information under the context of skin disease images.

Major concern #2
RC: 2. As mentioned in the weakness section above, could you provide more insights into how the method works on different demographics (Performance on patients with dark skin tones)?In my opinion, it is necessary to have information on the patients' demographics, as there may be biases in the method that are not captured by the current evaluation method.
AR: Thank you for the valuable comment.We appreciate the observation regarding the absence of demographic information and the potential bias in our results.
The reason for not including demographics is the lack of such information in our dataset.Both the public dataset and our proprietary data are completely anonymized, with sensitive information such as patient gender, age, name, and nationality removed to meet the requirements of privacy laws, such as Biosecurity Law in China and HIPAA in the USA.While we are unable to present statistical results for demographics, visual inspection indicates that the majority of our samples fall within Fitzpatrick I-IV skin tones.Notably, data for Fitzpatrick V-VI skin tones (dark skin tones) is underrepresented in our dataset, limiting the model's effectiveness in diagnosing skin diseases in individuals with Fitzpatrick V-VI.Therefore, we did not focus on the diagnosis of Fitzpatrick V-VI skin tones in this study due to the following two reasons: 1. Limited research and commonly recognized difficulties in the diagnosis of skin disease for Fitzpatrick V-VI skin tones: Current studies on Fitzpatrick V-VI skin tones are relatively scarce, and state-of-the-art dermatological AI algorithms exhibit notably lower performance for lesions on dark skin compared to their efficacy for light-colored skin, especially in cases confirmed by biopsies.The primary challenge lies in the less conspicuous early characteristics of skin diseases for Fitzpatrick V-VI patients, making diagnosis more challenging.Consequently, individuals with darker skin tones often receive diagnoses at later stages, leading to increased morbidity, mortality, and associated costs.

Scarcity of Fitzpatrick V-VI data for model training: Existing datasets, such as the Diverse Dermatology
Images (DDI) dataset, are insufficient for training deep learning models, particularly those based on large language models like SkinGPT-4.Currently, there is no Fitzpatrick V-VI data available in the required format to train SkinGPT-4.
We appreciate your observation regarding potential bias in our method, especially when it comes to diagnosing skin diseases for Fitzpatrick V-VI patients.Recognizing the importance of explicitly stating the limitations of SkinGPT-4, we have stated this concern as a key consideration for future research directions in our main text.
We value your suggestion.Therefore, in our upcoming research, we intend to gather more data specifically representing individuals with darker skin tones and develop an independent version of SkinGPT-4 tailored to this demographic.In response to this concern, we refined our contribution claim and updated the discussion section in the main text, as outlined below: Current research on Fitzpatrick V-VI (dark skin tones) is relatively limited, and the performance of state-of-the-art dermatological AI algorithms is notably inferior for lesions on dark skin compared to their efficacy for light-colored skin, especially in cases confirmed by biopsies.The primary challenge arises from the less conspicuous early characteristics of certain dark skin diseases, leading to a more challenging diagnosis.Consequently, individuals with darker skin tones often receive diagnoses at later stages, resulting in heightened morbidity, mortality, and associated costs.Compounding this issue is the scarcity of Fitzpatrick V-VI data, such as the Diverse Dermatology Images (DDI) dataset, which is insufficient for training deep learning models, particularly those based on LLMs such as SkinGPT-4.
In this study, our dataset primarily comprises Fitzpatrick I-IV skin tones, inadvertently limiting the model's efficacy in diagnosing skin diseases in individuals with Fitzpatrick V-VI.To address this limitation, future research endeavors will involve the systematic collection of Fitzpatrick V-VI data and the targeted training of SkinGPT-4 to enhance its diagnostic capabilities for Fitzpatrick V-VI patients.

Major concern #3
RC: 3. SkinGPT-4's ability to provide preliminary diagnoses empowers patients to make informed decisions about whether to seek in-person medical attention.This capability was highlighted as a strength of the method in the paper.However, a crucial concern arises regarding the reliability of these preliminary diagnoses.For instance, in a scenario where the framework incorrectly diagnoses a skin condition, a patient might forego a doctor's visit.As time passes, the condition could deteriorate, leading to a delayed visit when treatment becomes more challenging.How does the framework intend to address such situations, ensuring the reliability of its diagnostic recommendations?AR: We appreciate your attention to the critical concern of addressing potential misinformation and misdiagnosis 6 associated with SkinGPT-4, particularly in the context of AI systems designed for medical diagnostics.
Safeguarding patients is of utmost importance, and we are committed to addressing this challenge.
It is important to note that, in the real world, perfect AI systems do not and will not exist, so as perfect domain experts, errors are inevitable.Therefore, the key distinction lies in the error rate.Technically, we recognize the need for a solution that involves the implementation of a lifelong learning model in the future.
This approach would allow continuous data collection and model refinement, thereby reducing the probability of errors over time, which is exactly what we are doing now.
Meanwhile, in our claim of contributions, it is crucial to note that SkinGPT-4 is not intended to replace dermatologists.Instead, it is conceived as an evolving and continuously optimized tool.SkinGPT-4 is envisioned to serve as an assistant in facilitating communication between patients and doctors.It can provide patients with additional information about skin diseases, while doctors can leverage SkinGPT-4 to aid in the diagnostic process.That is, SkinGPT-4 was meant to provide a second-opinion and second-voice to both patients and clinicians.SkinGPT-4 could enhance users' comprehension of their medical conditions, facilitate improved communication between patients and doctors, expedite the diagnostic process for dermatologists, help triage, and potentially promote human-centred care and healthcare equity in underdeveloped areas.
Especially in many underdeveloped areas, misdiagnosis might not be the primary concern, because the most critical problem is the shortage of dermatologists, thus many patients find it impossible to get any medical advice.For example, a shortage of dermatologists is reported for most African countries (Namibia 0•8, Since our work is the first study of multi-modal large language models in skin disease diagnosis, we acknowledge the importance of clear disclaimers and guidance on the software page.Specifically, we include a prominent advisory stating the necessity of adhering to medical advice, and we emphasize the recommendation for users to consult with a qualified physician for specific diagnostic results whenever possible.These precautionary measures are implemented to promote responsible use and ensure that users comprehend the potential risk of the SkinGPT-4 within a medical context.
We have updated the discussion section in the main text accordingly as below: As LLMs-based applications like SkinGPT-4 continue to evolve and improve with the acquisition of even more reliable medical training data, the potential for significant advancements in online medical services is enormous.SkinGPT-4 could play a critical role in improving access to healthcare and enhancing the quality of medical services for patients worldwide.It is crucial to underscore that no AI system is infallible and entirely free from misinformation and misdiagnosis.Therefore, SkinGPT-4 is not designed to replace dermatologists but rather to serve as an evolving and continuously optimized tool, functioning as an assistant in facilitating communication between patients and doctors.Our aspiration for SkinGPT-4 is to provide patients with more information about skin diseases, while also offering doctors valuable assistance in the diagnostic process.Therefore, we included clear disclaimers and guidance on the software page.This includes a prominent advisory, emphasizing the importance of adhering to medical advice, and a strong recommendation to consult with a qualified physician for specific diagnostic results.These precautionary measures are in place to encourage responsible use and ensure that users comprehend the limitations of the software in a medical context.We will continue our research in this field to further develop and refine this technology.

Major concern #4
RC: 4. In light of the described scenario, how is responsibility for the situation managed?AR: Thank you for this important concern.We acknowledge the importance of managing responsibility in scenarios where users may make decisions based on SkinGPT-4's recommendations.
While we strive to minimize the occurrence of misinformation by continuously improving the model's accuracy and transparency, it is also crucial for users to provide informed consent before using SkinGPT-4 and be aware of the limitations and risks associated with relying on SkinGPT-4 for medical advice.As we stated in the previous response, users will be explicitly informed about the preliminary nature of the diagnoses and are encouraged to seek professional medical guidance as early as possible to ensure responsible decision-making regarding their disease.These precautionary measures are implemented to promote responsible use and ensure that users comprehend the potential risk of the SkinGPT-4 within a medical context.
We added a disclaimer before the use of the software, users must offer consent to use SkinGPT-4 as shown below.
2.6.Major concern #5 RC: 5. Regarding the evaluations, how many certified dermatologists were involved in the evaluation and comparison process?What type of evaluation was conducted?Were the evaluators aware that the generations are from LLMs? AR: Thank you for the valuable feedback.During the evaluation, we actively engaged five certified dermatologists.
As we stated in the manuscript, the outputs generated by SkinGPT-4 are different compared to traditional classification tasks.SkinGPT-4 generates textual descriptions, encompassing medical feature details and potential judgments about skin disease types.Consequently, dermatologists were required to conduct assessments from a medical perspective.Their role involved evaluating the reliability of the answers generated by SkinGPT-4 by comparing them to the diagnoses made by dermatologists.Dermatologists knew that the results were generated by SkinGPT-4.This approach ensured a comprehensive evaluation that incorporated both medical expertise and an understanding of the capabilities and limitations of LLMs in the context of dermatological diagnoses.
We incorporated additional details in the main text as below: To assess the reliability and effectiveness of SkinGPT-4, we curated a dataset comprising 150 real-life cases of various skin diseases as shown in Table 2. Interactive diagnosis sessions were conducted with SkinGPT-4, utilizing four specific prompts: 1. Could you describe the skin disease in this image for me? 2. Please provide a paragraph listing additional features you observed in the image.
3. Based on the previous information, please provide a detailed explanation of the cause of this skin disease.

What treatment and medication should be recommended for this case?
To conduct the clinical evaluation, five certified dermatologists were provided with the same set of four questions and were required to make diagnoses based on the given skin disease images.The dermatologists were then presented with the results generated by SkinGPT-4 and told that the results were generated by LLMs.The next major goal is to evaluate the usability of SkinGPT-4 by comparing the results generated by SkinGPT-4 with those evaluated by dermatologists.Then, the dermatologists evaluated the results generated by SkinGPT-4 and assigned scores (strongly agree, agree, neutral, disagree, and strongly disagree) to each item in the evaluation form (Figure 4a), including the following questions: 1. SkinGPT-4's diagnosis is correct or relevant.
5. SkinGPT-4 can help patients to understand their disease better.
7. Willingness to use SkinGPT-4.AR: Thank you for the valuable comment regarding the feasibility and computational resource requirements for the local deployment of our framework.We acknowledge the need to provide more detailed information on the practical aspects of local deployment, including minimum hardware requirements and potential challenges associated with limited computational resources.
We understand that advanced GPU computing resources may not be easily accessible to general users.This is also one of the major reasons why we replaced the Falcon-40B-Instruct in the previous version with a smaller and more easily deployable LLM (Llama-2-13b-chat) in the revised version.To deploy SkinGPT-4 entirely locally, a Linux system (e.g.Ubuntu 18.04) is mandatory.For acceleration, we recommend using a GPU with at least 30GB of memory (e.g.NVIDIA V100).In situations where the GPU is not available, SkinGPT-4 could also run on CPUs but requires at least 30GB RAM, which is easier for the general user to meet.Many personal laptops could meet these requirements.This adaptability ensures broader accessibility and addresses concerns about limiting the framework's usability to patients with sufficient computational resources.
We added more details into the section "Hyperparameters and resources for model training and inference" in the main text as below:

Hyperparameters and resources for model training and inference
During the training of both steps, the max number of epochs was fixed to 20, the iteration of each epoch was set to 5000, the warmup step was set to 5000, the learning rate was set to 1e-4, and the max text length was set to 160.The entire fine-tuning process required approximately 24 hours to complete and utilized eight NVIDIA A100 (80GB) GPUs.To deploy SkinGPT-4 entirely locally, a Linux system (e.g.Ubuntu 18.04) is mandatory.For acceleration, we recommend using a GPU with at least 30GB of memory (e.g.NVIDIA V100).In situations where the GPU is not available, SkinGPT-4 could also run on CPUs but requires at least 30GB Random Access Memory (RAM).SkinGPT-4 was developed using Python 3.7, PyTorch 1.9.1, and CUDA 11.4.For a comprehensive list of dependencies, please refer to our code availability documentation.
2.9.Major concern #8 RC: 8. Regarding the supplementary section, is it ethical to share images of real-life cases when the identities of the patients are visible in the images?For example, cases like rosacea4, rosacea10, lupus1 etc.During a presentation, would it be better to blur certain areas in the images to preserve privacy?AR: We acknowledge your ethical concerns regarding the inclusion of images from real-life cases in the supplementary sections.To address these concerns, we explicitly checked those images and excluded identifiable features to protect privacy.This measure aligns with ethical standards and reinforces our commitment to the responsible handling of sensitive information to meet the regulation of HIPAA.6. Execute demo.pypython demo.py--cfg-path eval_configs/skingpt4_eval_llama2_13bchat.yaml--gpu-id 0 We tested the procedure above on a new Ubuntu 18.04 system and are sure that demo.pyfile could be executed without error.In case the reviewer encounters any other issues, please feel free to open an issue anonymously at our code repository and we will respond immediately.We did not include those metrics in the previous manuscript because our task is more like an image captioning task than a classification task.The results generated by SkinGPT-4 are textual descriptions containing medical feature descriptions and potential judgments on skin disease.Those results could also vary according to users' questions.Therefore, following the common practice in LLM evaluation, it is more meaningful to systematically evaluate the reliability of generated reports of SkinGPT-4 by engaging dermatologists in the diagnostic process, allowing them to evaluate the LLM's responses from a clinical perspective by comparing the generated results with the diagnosis made by dermatologists.

Major concern #2
RC: 2. From Figure 4C, we see that a few cases confused the SkinGPT4 algorithm, such as eczema vs.
psoriasis/LP, and connective tissue disease vs. acne and rosacea.Can the authors comment on why they think the algorithm had more difficulty with these cases?
AR: We appreciate your insight into the challenging cases illustrated in Figure 4C.Comparison of connective tissue diseases case 5 in Figure 4c, Dermatomyositis and Rosacea.
Meanwhile, the training of SkinGPT-4 did not explicitly account for complex skin diseases, which often involve overlapping features of multiple skin diseases, posing challenges even for dermatologists.We acknowledge this limitation and emphasize our commitment to addressing it in future work.Since real-world skin disease images may encompass various skin diseases simultaneously, forming intricate and composite cases, SkinGPT-4 could describe the medical features and potential types of skin diseases in the picture in terms of probability.When multiple skin features are present in the same picture, SkinGPT-4 may be confused.However, it's important to note that such confusion doesn't necessarily indicate inaccuracies in SkinGPT-4 but rather highlights the presence of features potentially associated with multiple skin diseases in the same image.This further explains why we need dermatologists to evaluate the performance of SkinGPT-4 case by case.Tackling these complex scenarios constitutes a vital area for future research, and we recognize it as a key direction for further investigation.
We updated the discussion section in the main text as below: The diagnosis of intricate skin diseases poses another challenge for SkinGPT-4.In reality.Complex skin diseases frequently occur, encompassing a combination of diverse skin diseases exhibiting a multitude of characteristics.Currently, there is a lack of datasets containing multi-label skin disease images along with corresponding dermatologists' diagnoses.Addressing this gap in data constitutes a key focus for future research endeavors to apply SkinGPT-4 in the diagnosis of complex skin diseases.

Major concern #3
RC: 3. Potentially related to the above question: I assume that classic cases of each diagnosis were used in the validation steps.How does the model perform with "edge" cases, such as rashes that seem indeterminate?
AR: Thank you for this excellent question.We recognize the importance of delving into the performance of SkinGPT-4 on "edge" cases and samples exhibiting diverse severity levels.
To address this concern, we specifically tested SkinGPT-4 on different severity levels of Acne, which were classified according to the Chinese guidelines for the treatment of acne (Revised 2019).The results of this evaluation are presented in Figure S2 in the main text (Response Figure 4).Notably, we chose to focus on Acne rather than psoriasis and atopic dermatitis.The rationale for this choice lies in the grading methodologies of the latter diseases, which typically involve assessing the percentage of skin lesions relative to the entire skin area, instead of using just one image.As demonstrated in Figure S2 in the main text (Response Figure 4), SkinGPT-4 still exhibits robust and acceptable performance when presented with skin disease images with differing severity levels of Acne. Reference: 1. Chinese Expert Group of Acne Treatment Guidelines.Chinese guidelines for the treatment of acne (Revised 2019).J Clin Dermatol.2019;48(9):583-588.doi: 10.16761/j.cnki.1000-4963.2019.09.020 We added Figure S2 to the supplementary information and updated the discussion section of the main text as below: Deploying SkinGPT-4 in real-world scenarios may pose potential challenges, particularly due to the variability in patient-submitted images.Factors contributing to this variability include differences in smartphone camera quality, variations in image pre-and post-processing, diverse angles, and varying lighting conditions.Additionally, addressing the diverse severity levels of skin diseases presents another challenge.During the training of SkinGPT-4, we lacked the specific data required to enable the model to identify the severity of skin diseases accurately.Nevertheless, as demonstrated in Figure S2, SkinGPT-4 still exhibits robust and acceptable performance when presented with skin disease images captured under varying angles, lighting conditions, pixel densities, and resolutions with differing severity levels of Acne, which were classified according to the Chinese guidelines for the treatment of acne (Revised 2019).As shown in Figure S3, a guideline for users was also implemented, prompting them to capture images as appropriately as possible.This approach aims to standardize the format of uploaded images, facilitating SkinGPT-4's ability to identify skin disease features effectively.

Major concern #4
RC: 4.There is significant variation with patient-submitted images, especially with quality of smartphone cameras, image pre-and post-processing done on the phone software, angles, and lighting.How was the process standardized, if at all, for the submission of such photos?
AR: Thank you for providing this valuable comment.
We acknowledge the potential challenges associated with variations in patient-submitted images, including differences in smartphone camera quality, image pre-and post-processing, angles, and lighting.To mitigate these challenges, SkinGPT-4 has undergone training on a diverse set of images, enhancing its robustness to various input variations.Meanwhile, the diverse set of user-uploaded images showcased in Figure S2 (Response Figure 4) reflects real-world scenarios and underscores the stability of SkinGPT-4 in handling skin disease images captured under varying angles, lighting conditions, pixel densities, and resolutions even for the same kind of skin disease.
Meanwhile, we have also incorporated user guidelines, encouraging them to capture images as appropriately as possible.This strategy is intended to standardize the format of uploaded images, thereby aiding the model in effectively identifying skin disease features, as demonstrated in Figure S3 (Response Figure 5).AR: We appreciate your concern regarding patient privacy and our commitment to complying with regulations such as HIPAA.
Our institution has a strict data transfer agreement with our collaborators under the KAUST Ref. OTA1363 to ensure data safety.We also strictly adhere to the HIPAA privacy rule during both the model training and local deployment, addressing the use and disclosure of protected health information (PHI).
Throughout the model training phase, the anonymization of both public and proprietary datasets was ensured.
Sensitive information, including patient gender, age, name, and nationality, was removed.Additionally, when handling images of certain skin diseases, identifiable biometric features were removed to align with HIPAA standards.
During the local deployment of SkinGPT-4, where the method could be used without an internet connection and retain no patient data, full compliance with HIPAA standards is ensured.Importantly, users utilizing SkinGPT-4 locally are not involved in disclosing any PHI to external entities, thereby adhering steadfastly to the foundational principles outlined by HIPAA.
We updated the dataset section in the main text as below: Throughout the model training phase, the anonymization of both public and proprietary datasets was ensured.Sensitive information, including patient gender, age, name, and nationality, was removed.
Additionally, when handling images of certain skin diseases, identifiable biometric features were removed to align with HIPAA standards.During the local deployment of SkinGPT-4, where the method could be used without an internet connection and retain no patient data, full compliance with HIPAA standards is ensured.Importantly, users utilizing SkinGPT-4 locally are not involved in disclosing any protected health information (PHI) to external entities, thereby adhering steadfastly to the foundational principles outlined by HIPAA.

Minor concern #1
RC: 1. Figure 3: I would suggest including just one panel (e.g.either A or B) for the figure in the main text, and put the other panel in the supplemental information.The text from the chat is too small to be easy to read on a computer or in print.SkinGPT-4 (step 1 only) MiniGPT-4 What is wrong with my skin?
The image shows a close-up view of the neck of an older woman with several small, red bumps on her skin.These bumps are likely caused by acne, which can be treated with over-the-counter creams or prescription medications.

Dermatologists
What is wrong with my skin?
The exact cause of acne is not known, but it is thought to be related to hormonal changes, excessive oil production, and bacteria that live on the skin.Acne can be treated with over-the-counter creams or prescription medications, such as topical antibiotics, retinoids, and benzoyl peroxide.These treatments can help to reduce inflammation, prevent new blemishes from forming, and improve the appearance of existing blemishes.4.9.Minor concern #2 RC: 2. For several figures (e.g. Figure 1 and 3), the colors red/green or blue-green are used.I am worried about color-blind readers.Would it be possible to change the color scheme away from red/green?This also makes the text in each chat bubble much harder to read as well.
This paper introduces SkinGPT-4, the first multi-modal LLM-based interactive diagnostic system for dermatology.The authors have developed a new framework by combining a pre-trained vision transformer with an LLM called Falcon-40B-Instruct.To train SkinGPT-4, a two-step training strategy is proposed, utilizing a large collection of skin disease images.
Ghana 1•1, South Africa 3, Botswana 3•3 dermatologists per million population) in comparison with the rest of the world (UK 10, USA 36, Germany 65 dermatologists per million population).(Ref: https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC9543359/) 2.7.Major concern #6 RC: 6.In Fig 4a, what does the barplot indicate?Is it the percentage of skin disease cases that the dermatologists agree on, or is it the percentage of dermatologists?AR: Thank you for the valuable comment.We apologize for any confusion resulting from our unclear writing.The barplot in Fig 4a represents the percentage of skin disease cases on which the dermatologists reached an agreement.We updated the caption of Figure 4 in the main text as below: Clinical evaluation of SkinGPT-4 by certified offline and online dermatologists.a. Questionnairebased assessment of SkinGPT-4 by offline dermatologists.The barplot represents the percentage of skin disease cases that the dermatologists agree on.b.Response time of SkinGPT-4 compared to consulting dermatologists online.c.Consistency test of SkinGPT-4's responses.The x-axis indicates test samples, and the y-axis indicates the diagnostic results.2.8.Major concern #7 RC: 7. When it comes to local deployment, how practical is it?Does it imply that the usability of the framework is restricted to patients who have access to sufficient computational resources?In my opinion, adding more information on local deployment would enhance the quality of the paper.
Illustrated in the figure below is the comparison of connective tissue disease case 5 in Figure4c, dermatomyositis, and rosacea.SkinGPT-4 exhibited uncertainty in distinguishing connective tissue diseases case 5, particularly in discerning between connective tissue disease and rosacea.In comparison, dermatomyositis is an idiopathic autoimmune connective tissue disease characterized by proximal muscle weakness and skin rashes.It is clear that the primary cause of SkinGPT-4 confusion lies in the evident overlap of certain characteristics among these diverse skin diseases.

Figure
Figure S2 in the supplementary information.Diagnosis generated by SkinGPT-4 on skin disease images under varying angles, lighting, pixels, and resolution conditions in Acne cases of different severity.Acne classification is proposed following the Chinese guidelines for the treatment of acne (Revised 2019).21

Figure
FigureS3in the supplementary information.Recommended guidelines for submitting qualified skin disease images by users when using SkinGPT-4.The image should be well-defined, allowing for clear visibility of the skin disease area, including its color, shape, and other relevant details visible to the naked eye.The skin disease area is suggested to be positioned at the center of the picture.Examples of qualified and unqualified images are provided..
Figure1(in main text).Illustration of SkinGPT-4.SkinGPT-4 is the world's first interactive dermatology diagnostic system based on multimodal large language models.To implement SkinGPT-4, we have designed a new framework that aligned a pre-trained vision transformer with a large language model named Llama-2-13b-chat.SkinGPT-4 was trained on a vast collection (52,929) of both public and in-house skin disease images, accompanied by clinical concepts and doctors' notes.With SkinGPT-4, users could upload their own skin photos for diagnosis, and SkinGPT-4 could autonomously determine the characteristics and categories of skin conditions, perform analysis, provide treatment recommendations, and allow interactive diagnosis.On the right is an example of interactive diagnosis.

Table 1 .
Detailed metrics for all skin disease classes.
The first public dataset named SKINCON is the first medical dataset densely annotated by domain experts to provide annotations useful across multiple disease processes.SKINCON is a skin disease dataset densely annotated by dermatologists and it includes 3230 images from the Fitzpatrick 17k skin disease dataset densely annotated with 48 clinical concepts as shown in Table1, 22 of which have at least 50 images representing the concept, and 656 skin disease images from the Diverse Our private in-house dataset contains 30,187 pairs of skin disease images and corresponding doctors' descriptions.The complete dataset for step 2 training comprises in total of 49,043 pairs of images and textual descriptions as shown in Table2.All cases underwent diagnoses through standard diagnostic procedures conducted by dermatologists.Simple cases within the dataset have not been confirmed by Skin of color in dermatology remains poorly represented in machine learning models (and in many dermatology clinics around the world).This has been a deficiency in previous publications on ML in dermatology.Please see and cite https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9374341/.I highly recommend that the authors add their response to this question to the discussion section of the article.AR: We appreciate your insightful observation and we acknowledge the limitation in our training dataset.Both the public dataset and our proprietary data are completely anonymized, with sensitive information such as patient gender, age, name, and nationality removed to meet the requirements of privacy laws, such as Biosecurity Law in China and HIPAA in the USA.Therefore, we were unable to present statistical results for demographics.
With actinic keratosis, there is a central plaque or nodule that eventually ulcerates and becomes hemorrhagic.The margin is often indistinct and there may be satellite papules or pustules.The lesions can be itchy and can lead to scarring and pigmentation.The disease can also affect the skin on the ears, nose, and fingers.Please tell us more about this skin disease, what are the causes?Please tell us more about this skin disease, what are the causes?This could be a skin disease image of epidermal cyst.This skin disease is caused by the overproduction of sebum by the sebaceous glands.Sebum is a lubricating substance that helps to keep the skin and hair moisturized.Excess sebum can cause clogged pores, which can lead to the formation of whiteheads, blackheads, or pustules.The bacteria that are present on the skin can infect these lesions and cause inflammation, redness, and pus.This condition can occur on various parts of the body, including the face, neck, and back.It is important to maintain good hygiene and avoid excessive use of cosmetics, as they can clog the pores and cause this condition.
Did you scratch it with your hand?Are there any other places?Yes It is highly probable that you have solar keratosis.It is recommended that you visit a hospital for a proper diagnosis.In most cases, laser treatment can effectively treat the condition.It is important to practice sun protection and avoid scratching the affected area.Actinic keratosis, also known as squamous cell carcinoma, is a type of skin cancer that arises from the squamous cells in the skin.It typically affects people with fair skin, older age, and a history of sun exposure.The cancer often begins as a small, scaly bump or wart-like growth that slowly increases in size and becomes rough and crusted.The cancer can spread to other parts of the body, including the lymph nodes, lung, and liver.What is wrong with my skin?SkinGPT-4 (step 2 only) What is wrong with my skin?Please tell us more about this skin disease, what are the causes?Please tell us more about this skin disease, what are the causes?